A bank in you region want to build a model to predict credit card defaulters more accurately in order minimize money loss. For this task we have a set of data on default payments and demographic data to help us do our task. Data is comprised in the following CSV files:
TRAINING
train_customers.csv
ID: ID of each clientLIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary creditSEX: Gender (1=male, 2=female)EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)MARRIAGE: Marital status (1=married, 2=single, 3=others)AGE: Age in yearstrain_series.csv
ID: ID of each clientMONTH: The month to wich data is referingPAY: Repayment status in the corresponding month (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)BILL_AMT: Amount of bill statement in the corresponding month (NT dollar)PAY_AMT: Amount of previous payment in the corresponding month (NT dollar)train_target.csv
DEFAULT_JULY: Default payment in July (1=yes, 0=no)TEST (For PART 3)
test_data.csvSUBMISSION (For PART 3)
submission_features.csvBACKUP
train_data.csvLoad all data and answer the following questions
import pandas as pd
import plotly.express as px
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import plotly.graph_objects as go
import seaborn as sn
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
train_customers = pd.read_csv('train_customers.csv')
train_series = pd.read_csv('train_series.csv')
train_target = pd.read_csv('train_target.csv')
test_data = pd.read_csv('test_data.csv')
submission_features = pd.read_csv('submission_features.csv')
train_data = pd.read_csv('train_data.csv')
(2 points)
train_series.head()
| ID | MONTH | PAY | BILL_AMT | PAY_AMT | |
|---|---|---|---|---|---|
| 0 | 9910 | JUNE | 0.0 | 76885.0 | 3500.0 |
| 1 | 9910 | MAY | 0.0 | 79106.0 | 4000.0 |
| 2 | 9910 | APRIL | 0.0 | 81231.0 | 3000.0 |
| 3 | 9910 | MARCH | 0.0 | 81983.0 | 3146.0 |
| 4 | 9910 | FEBRUARY | 0.0 | 83773.0 | 3260.0 |
ts_pivot = train_series.pivot(index='ID', columns='MONTH', values=['PAY', 'BILL_AMT', 'PAY_AMT'])
ts_pivot.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in ts_pivot.columns]
ts_pivot.head()
| PAY_APRIL | PAY_FEBRUARY | PAY_JANUARY | PAY_JUNE | PAY_MARCH | PAY_MAY | BILL_AMT_APRIL | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | BILL_AMT_JUNE | BILL_AMT_MARCH | BILL_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MARCH | PAY_AMT_MAY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | ||||||||||||||||||
| 0 | 0.0 | -1.0 | 0.0 | -1.0 | 0.0 | 0.0 | 13600.0 | 44024.0 | 18697.0 | 28991.0 | 0.0 | 24391.0 | 0.0 | 680.0 | 10000.0 | 1300.0 | 22373.0 | 1000.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 53169.0 | 50372.0 | 49470.0 | 75662.0 | 50875.0 | 70073.0 | 2000.0 | 1903.0 | 2006.0 | 3212.0 | 1603.0 | 2106.0 |
| 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 161487.0 | 168094.0 | 170922.0 | 155910.0 | 157577.0 | 158819.0 | 0.0 | 5500.0 | 1000.0 | 6800.0 | 13000.0 | 6500.0 |
| 3 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | 0.0 | 0.0 | 0.0 | 23570.0 | 0.0 | 735.0 | 0.0 | 0.0 | 0.0 | 742.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9044.0 | 9417.0 | 9617.0 | 8214.0 | 9225.0 | 8034.0 | 331.0 | 356.0 | 330.0 | 1140.0 | 341.0 | 1150.0 |
(0.5 points)
train_customers.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | |
|---|---|---|---|---|---|---|
| 0 | 9910 | 130000.0 | 2 | 2 | 1 | 27 |
| 1 | 15561 | 60000.0 | 2 | 3 | 1 | 48 |
| 2 | 23607 | 330000.0 | 2 | 1 | 2 | 44 |
| 3 | 6314 | 60000.0 | 2 | 2 | 2 | 24 |
| 4 | 27534 | 180000.0 | 2 | 1 | 2 | 33 |
tc_merged = pd.merge(train_customers, ts_pivot, on='ID', how='left')
tc_merged.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_APRIL | PAY_FEBRUARY | PAY_JANUARY | PAY_JUNE | PAY_MARCH | PAY_MAY | BILL_AMT_APRIL | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | BILL_AMT_JUNE | BILL_AMT_MARCH | BILL_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MARCH | PAY_AMT_MAY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9910 | 130000.0 | 2 | 2 | 1 | 27 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 81231.0 | 83773.0 | 85532.0 | 76885.0 | 81983.0 | 79106.0 | 3000.0 | 3260.0 | 3200.0 | 3500.0 | 3146.0 | 4000.0 |
| 1 | 15561 | 60000.0 | 2 | 3 | 1 | 48 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 8422.0 | 3910.0 | 2431.0 | 4823.0 | 4377.0 | 5471.0 | 4377.0 | 2431.0 | 2120.0 | 5491.0 | 3918.0 | 9683.0 |
| 2 | 23607 | 330000.0 | 2 | 1 | 2 | 44 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 253863.0 | 262753.0 | 268145.0 | 243621.0 | 257535.0 | 259517.0 | 9400.0 | 9766.0 | 9786.0 | 21400.0 | 9542.0 | 0.0 |
| 3 | 6314 | 60000.0 | 2 | 2 | 2 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15069.0 | 5879.0 | -2879.0 | 29832.0 | 15761.0 | 18628.0 | 1000.0 | 1329.0 | 41378.0 | 1538.0 | 118.0 | 1308.0 |
| 4 | 27534 | 180000.0 | 2 | 1 | 2 | 33 | -1.0 | -1.0 | -1.0 | -2.0 | -1.0 | -2.0 | 178.0 | 0.0 | 1118.0 | 4500.0 | 1500.0 | 2580.0 | 1500.0 | 1118.0 | 1331.0 | 2580.0 | 0.0 | 178.0 |
(0.5 points)
train_target.head()
| ID | DEFAULT_JULY | |
|---|---|---|
| 0 | 9910 | 0 |
| 1 | 15561 | 1 |
| 2 | 23607 | 1 |
| 3 | 6314 | 0 |
| 4 | 27534 | 0 |
merged_data = pd.merge(tc_merged, train_target, on='ID', how='left')
merged_data.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_APRIL | PAY_FEBRUARY | PAY_JANUARY | PAY_JUNE | PAY_MARCH | PAY_MAY | BILL_AMT_APRIL | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | BILL_AMT_JUNE | BILL_AMT_MARCH | BILL_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MARCH | PAY_AMT_MAY | DEFAULT_JULY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9910 | 130000.0 | 2 | 2 | 1 | 27 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 81231.0 | 83773.0 | 85532.0 | 76885.0 | 81983.0 | 79106.0 | 3000.0 | 3260.0 | 3200.0 | 3500.0 | 3146.0 | 4000.0 | 0 |
| 1 | 15561 | 60000.0 | 2 | 3 | 1 | 48 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 8422.0 | 3910.0 | 2431.0 | 4823.0 | 4377.0 | 5471.0 | 4377.0 | 2431.0 | 2120.0 | 5491.0 | 3918.0 | 9683.0 | 1 |
| 2 | 23607 | 330000.0 | 2 | 1 | 2 | 44 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 253863.0 | 262753.0 | 268145.0 | 243621.0 | 257535.0 | 259517.0 | 9400.0 | 9766.0 | 9786.0 | 21400.0 | 9542.0 | 0.0 | 1 |
| 3 | 6314 | 60000.0 | 2 | 2 | 2 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15069.0 | 5879.0 | -2879.0 | 29832.0 | 15761.0 | 18628.0 | 1000.0 | 1329.0 | 41378.0 | 1538.0 | 118.0 | 1308.0 | 0 |
| 4 | 27534 | 180000.0 | 2 | 1 | 2 | 33 | -1.0 | -1.0 | -1.0 | -2.0 | -1.0 | -2.0 | 178.0 | 0.0 | 1118.0 | 4500.0 | 1500.0 | 2580.0 | 1500.0 | 1118.0 | 1331.0 | 2580.0 | 0.0 | 178.0 | 0 |
# This creates the csv used in submission
#merged_data.to_csv('Submissions/FordArturo_A_train.csv', index=False)
(0.5 points)
Per the data, it seems that most individuals who tend to default also achieve a higher education, something that isn't necessarily intuitive at first glance. However, conclusions cannot be drawn only based on this histogram as it doesn't show the proportion of those who don't default for a proper comparison. It could well just be the case that more individuals with higher education are credit card users overall, thereby inflating the amounts on the histogram regardless of their proportion to those who don't default.
defaulters = merged_data[merged_data.DEFAULT_JULY == 1]['EDUCATION']
defaulters_hist = px.histogram(defaulters, x='EDUCATION')
defaulters_hist
(0.5 points)
The data indicates that those who default in July tend to have a less Balance Limit than their non-default counterparts. However, this isn't a definite indicator that a given person might default, as evidenced in the correlation matrix below. While it is a pattern that might be logical, considering lenders tend to be cautious about giving out higher limits, it seems that the trend is not strong enough to be an accurate predictor on its own.
lim_bal_defaults = merged_data[['LIMIT_BAL', 'DEFAULT_JULY']]
lim_bal_hist = px.box(lim_bal_defaults, x='DEFAULT_JULY', y='LIMIT_BAL')
lim_bal_hist
(1 point)
full_df = merged_data.copy()
bins = [17, 25, 35, 45, 55, 65, 1000]
labels = ['18-25', '26-35', '36-45', '46-55', '56-65', '66+']
full_df['AGE_BIN'] = pd.cut(full_df['AGE'], bins=bins, labels=labels)
full_df.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_APRIL | PAY_FEBRUARY | PAY_JANUARY | PAY_JUNE | PAY_MARCH | PAY_MAY | BILL_AMT_APRIL | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | BILL_AMT_JUNE | BILL_AMT_MARCH | BILL_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MARCH | PAY_AMT_MAY | DEFAULT_JULY | AGE_BIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9910 | 130000.0 | 2 | 2 | 1 | 27 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 81231.0 | 83773.0 | 85532.0 | 76885.0 | 81983.0 | 79106.0 | 3000.0 | 3260.0 | 3200.0 | 3500.0 | 3146.0 | 4000.0 | 0 | 26-35 |
| 1 | 15561 | 60000.0 | 2 | 3 | 1 | 48 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 8422.0 | 3910.0 | 2431.0 | 4823.0 | 4377.0 | 5471.0 | 4377.0 | 2431.0 | 2120.0 | 5491.0 | 3918.0 | 9683.0 | 1 | 46-55 |
| 2 | 23607 | 330000.0 | 2 | 1 | 2 | 44 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 253863.0 | 262753.0 | 268145.0 | 243621.0 | 257535.0 | 259517.0 | 9400.0 | 9766.0 | 9786.0 | 21400.0 | 9542.0 | 0.0 | 1 | 36-45 |
| 3 | 6314 | 60000.0 | 2 | 2 | 2 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15069.0 | 5879.0 | -2879.0 | 29832.0 | 15761.0 | 18628.0 | 1000.0 | 1329.0 | 41378.0 | 1538.0 | 118.0 | 1308.0 | 0 | 18-25 |
| 4 | 27534 | 180000.0 | 2 | 1 | 2 | 33 | -1.0 | -1.0 | -1.0 | -2.0 | -1.0 | -2.0 | 178.0 | 0.0 | 1118.0 | 4500.0 | 1500.0 | 2580.0 | 1500.0 | 1118.0 | 1331.0 | 2580.0 | 0.0 | 178.0 | 0 | 26-35 |
In this pie chart we explore the age ranges of those individuals who default in July. The data points out that the vast majority of those who default (68.7% combined) are in the age groups of 26-35 and 36-45. Given the large number, it could indicate a pattern between age and difficulty making payments. More insights can be provided if we look at ratios of the total amount of customers in those categories.
#Insight 1: Age of Defaulters
sort_age = full_df.sort_values(by=['AGE']).copy()
chart1 = px.pie(sort_age, values='AGE', names='AGE_BIN', title='Defaulters by age group')
chart1.show()
By using this Tree Map, we can see how marriage status can affect the likelihood of a person defaulting. By looking at the resulting amounts within the boxes, we can see that the marriage status is almost evenly split in both the 'Not defaulted' and the 'Defaulted' groups. There is a little more single non-defaulters than married non-defaulters, but it is not very significant. It seems that this is not a good indicator of likelihood of defaulting.
#Insight 2: Marriage of Defaulters
full_df['DEF_LABEL'] = ['Defaulted' if i == 1 else 'Not defaulted' for i in full_df['DEFAULT_JULY']]
full_df['MARRIAGE_LABEL'] = ['Married' if i == 1 else 'Single' if i == 2 else 'Other' for i in full_df['MARRIAGE']]
chart2 = px.treemap(full_df, path=[px.Constant("Defaults in July"), 'DEF_LABEL', 'MARRIAGE_LABEL'])
chart2.show()
Data shows that there is a much larger share of those who pay duly or early if they have achieved a higher education, in comparison to those in the late pay categories. In those, we see roughly the same amounts for each group, so this could mean that those who pay late do so regardless of their educational backgrounds. In contrast, those who do pay in time tend to have achieved a higher education.
#Insight 3: Payment status per Education group
filter_educ = full_df[full_df['EDUCATION'].isin([1,2,3,4])].copy()
filter_educ['EDU_LABEL'] = ['Graduate' if i == 1 else 'University' if i == 2 else 'High School' if i == 3 else 'Other' for i in filter_educ['EDUCATION']]
filter_educ['PAY_JUNE_LABEL'] = ['Early Pay' if i < 0 else 'Pay Duly' if i == 0 else 'Late 1 month' if i == 1 else 'Late 2 months' if i == 2 else 'Late >=3 months' for i in filter_educ['PAY_JUNE']]
filter_educ = filter_educ.sort_values(by=['PAY_JUNE'])
chart3 = px.histogram(filter_educ, x='PAY_JUNE_LABEL', color='EDU_LABEL')
chart3
In this part, we're going to build a machine learning model to estimate the probability of default of the customers. For this part, you will need two additional datasets:
test_data.csv - Contains new data from 9000 customers with the same structure as the training data from Part 1. It contains all the features and the target already joined. You can use this data set for testing purposes (or you can build your own test set from the training set). Anyway, keep it because you will need it for question C.2 submission_features.csv - This contains data associated with unseen customers, but without the target.
SOME CONSIDERATIONS Try several models to select the best suited for this problem Don't forget about hyperparameter tuning to select the best configuration for the model Don't forget to check the different metrics in each CV round to build a robust model
to_train = merged_data.copy()
to_train.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_APRIL | PAY_FEBRUARY | PAY_JANUARY | PAY_JUNE | PAY_MARCH | PAY_MAY | BILL_AMT_APRIL | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | BILL_AMT_JUNE | BILL_AMT_MARCH | BILL_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MARCH | PAY_AMT_MAY | DEFAULT_JULY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9910 | 130000.0 | 2 | 2 | 1 | 27 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 81231.0 | 83773.0 | 85532.0 | 76885.0 | 81983.0 | 79106.0 | 3000.0 | 3260.0 | 3200.0 | 3500.0 | 3146.0 | 4000.0 | 0 |
| 1 | 15561 | 60000.0 | 2 | 3 | 1 | 48 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 8422.0 | 3910.0 | 2431.0 | 4823.0 | 4377.0 | 5471.0 | 4377.0 | 2431.0 | 2120.0 | 5491.0 | 3918.0 | 9683.0 | 1 |
| 2 | 23607 | 330000.0 | 2 | 1 | 2 | 44 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 253863.0 | 262753.0 | 268145.0 | 243621.0 | 257535.0 | 259517.0 | 9400.0 | 9766.0 | 9786.0 | 21400.0 | 9542.0 | 0.0 | 1 |
| 3 | 6314 | 60000.0 | 2 | 2 | 2 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15069.0 | 5879.0 | -2879.0 | 29832.0 | 15761.0 | 18628.0 | 1000.0 | 1329.0 | 41378.0 | 1538.0 | 118.0 | 1308.0 | 0 |
| 4 | 27534 | 180000.0 | 2 | 1 | 2 | 33 | -1.0 | -1.0 | -1.0 | -2.0 | -1.0 | -2.0 | 178.0 | 0.0 | 1118.0 | 4500.0 | 1500.0 | 2580.0 | 1500.0 | 1118.0 | 1331.0 | 2580.0 | 0.0 | 178.0 | 0 |
# This function creates 1 or 0 values if the person in question has ever missed a payment.
def has_missed_pay(to_train):
if ((to_train['PAY_JUNE'] > 0) or
(to_train['PAY_MAY'] > 0) or
(to_train['PAY_APRIL'] > 0) or
(to_train['PAY_MARCH'] > 0) or
(to_train['PAY_FEBRUARY'] > 0) or
(to_train['PAY_JANUARY'] > 0)):
return 1
else:
return 0
to_train['HAS_MISSED_PAY_BEFORE'] = to_train.apply(has_missed_pay, axis=1)
to_train.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_APRIL | PAY_FEBRUARY | PAY_JANUARY | PAY_JUNE | PAY_MARCH | PAY_MAY | BILL_AMT_APRIL | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | BILL_AMT_JUNE | BILL_AMT_MARCH | BILL_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MARCH | PAY_AMT_MAY | DEFAULT_JULY | HAS_MISSED_PAY_BEFORE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9910 | 130000.0 | 2 | 2 | 1 | 27 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 81231.0 | 83773.0 | 85532.0 | 76885.0 | 81983.0 | 79106.0 | 3000.0 | 3260.0 | 3200.0 | 3500.0 | 3146.0 | 4000.0 | 0 | 0 |
| 1 | 15561 | 60000.0 | 2 | 3 | 1 | 48 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 8422.0 | 3910.0 | 2431.0 | 4823.0 | 4377.0 | 5471.0 | 4377.0 | 2431.0 | 2120.0 | 5491.0 | 3918.0 | 9683.0 | 1 | 0 |
| 2 | 23607 | 330000.0 | 2 | 1 | 2 | 44 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 253863.0 | 262753.0 | 268145.0 | 243621.0 | 257535.0 | 259517.0 | 9400.0 | 9766.0 | 9786.0 | 21400.0 | 9542.0 | 0.0 | 1 | 1 |
| 3 | 6314 | 60000.0 | 2 | 2 | 2 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15069.0 | 5879.0 | -2879.0 | 29832.0 | 15761.0 | 18628.0 | 1000.0 | 1329.0 | 41378.0 | 1538.0 | 118.0 | 1308.0 | 0 | 0 |
| 4 | 27534 | 180000.0 | 2 | 1 | 2 | 33 | -1.0 | -1.0 | -1.0 | -2.0 | -1.0 | -2.0 | 178.0 | 0.0 | 1118.0 | 4500.0 | 1500.0 | 2580.0 | 1500.0 | 1118.0 | 1331.0 | 2580.0 | 0.0 | 178.0 | 0 | 0 |
# This function counts the amount of times a person has defaulted before.
def times_missed_pay(to_train):
times = 0
for i in (to_train['PAY_JUNE'], to_train['PAY_MAY'], to_train['PAY_APRIL'],
to_train['PAY_MARCH'], to_train['PAY_FEBRUARY'], to_train['PAY_JANUARY']):
if i > 0:
times += 1
return times
to_train['TIMES_MISSED_PAY'] = to_train.apply(times_missed_pay, axis=1)
to_train.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_APRIL | PAY_FEBRUARY | PAY_JANUARY | PAY_JUNE | PAY_MARCH | PAY_MAY | BILL_AMT_APRIL | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | BILL_AMT_JUNE | BILL_AMT_MARCH | BILL_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MARCH | PAY_AMT_MAY | DEFAULT_JULY | HAS_MISSED_PAY_BEFORE | TIMES_MISSED_PAY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9910 | 130000.0 | 2 | 2 | 1 | 27 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 81231.0 | 83773.0 | 85532.0 | 76885.0 | 81983.0 | 79106.0 | 3000.0 | 3260.0 | 3200.0 | 3500.0 | 3146.0 | 4000.0 | 0 | 0 | 0 |
| 1 | 15561 | 60000.0 | 2 | 3 | 1 | 48 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 8422.0 | 3910.0 | 2431.0 | 4823.0 | 4377.0 | 5471.0 | 4377.0 | 2431.0 | 2120.0 | 5491.0 | 3918.0 | 9683.0 | 1 | 0 | 0 |
| 2 | 23607 | 330000.0 | 2 | 1 | 2 | 44 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 253863.0 | 262753.0 | 268145.0 | 243621.0 | 257535.0 | 259517.0 | 9400.0 | 9766.0 | 9786.0 | 21400.0 | 9542.0 | 0.0 | 1 | 1 | 1 |
| 3 | 6314 | 60000.0 | 2 | 2 | 2 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15069.0 | 5879.0 | -2879.0 | 29832.0 | 15761.0 | 18628.0 | 1000.0 | 1329.0 | 41378.0 | 1538.0 | 118.0 | 1308.0 | 0 | 0 | 0 |
| 4 | 27534 | 180000.0 | 2 | 1 | 2 | 33 | -1.0 | -1.0 | -1.0 | -2.0 | -1.0 | -2.0 | 178.0 | 0.0 | 1118.0 | 4500.0 | 1500.0 | 2580.0 | 1500.0 | 1118.0 | 1331.0 | 2580.0 | 0.0 | 178.0 | 0 | 0 | 0 |
# This function calculates the average amount of debt paid by the person in question every month.
# Since this requires both the month of debt and the month of payment, it is programmed in a way
# that will only include the payment percentage if there was actual debt in a given month to avoid
# divisions by 0. Also, it doesn't check for June's debt since July isn't paid for yet.
# This function runs, but gets removed from the DataFrame as a result of low score in the correlation matrix.
def perc_paid_on_avg(to_train):
months = {1:'JANUARY', 2:'FEBRUARY', 3:'MARCH', 4:'APRIL', 5:'MAY', 6:'JUNE'}
ratios = []
for i, j in months.items():
if i == 6:
break
if (to_train[f'BILL_AMT_{j}'] == 0):
continue
debt_paid_ratio = 1 - ((to_train[f'BILL_AMT_{j}'] - to_train[f'PAY_AMT_{months[i+1]}']) / to_train[f'BILL_AMT_{j}'])
ratios.append(debt_paid_ratio)
if len(ratios) == 0:
return -1
result = sum(ratios)/len(ratios)
return result
to_train['PERC_PAID_ON_AVG'] = to_train.apply(perc_paid_on_avg, axis=1)
to_train
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_APRIL | PAY_FEBRUARY | PAY_JANUARY | PAY_JUNE | PAY_MARCH | PAY_MAY | BILL_AMT_APRIL | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | BILL_AMT_JUNE | BILL_AMT_MARCH | BILL_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MARCH | PAY_AMT_MAY | DEFAULT_JULY | HAS_MISSED_PAY_BEFORE | TIMES_MISSED_PAY | PERC_PAID_ON_AVG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9910 | 130000.0 | 2 | 2 | 1 | 27 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 81231.0 | 83773.0 | 85532.0 | 76885.0 | 81983.0 | 79106.0 | 3000.0 | 3260.0 | 3200.0 | 3500.0 | 3146.0 | 4000.0 | 0 | 0 | 0 | 0.041150 |
| 1 | 15561 | 60000.0 | 2 | 3 | 1 | 48 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 8422.0 | 3910.0 | 2431.0 | 4823.0 | 4377.0 | 5471.0 | 4377.0 | 2431.0 | 2120.0 | 5491.0 | 3918.0 | 9683.0 | 1 | 0 | 0 | 1.031086 |
| 2 | 23607 | 330000.0 | 2 | 1 | 2 | 44 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 253863.0 | 262753.0 | 268145.0 | 243621.0 | 257535.0 | 259517.0 | 9400.0 | 9766.0 | 9786.0 | 21400.0 | 9542.0 | 0.0 | 1 | 1 | 1 | 0.038339 |
| 3 | 6314 | 60000.0 | 2 | 2 | 2 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15069.0 | 5879.0 | -2879.0 | 29832.0 | 15761.0 | 18628.0 | 1000.0 | 1329.0 | 41378.0 | 1538.0 | 118.0 | 1308.0 | 0 | 0 | 0 | -0.041747 |
| 4 | 27534 | 180000.0 | 2 | 1 | 2 | 33 | -1.0 | -1.0 | -1.0 | -2.0 | -1.0 | -2.0 | 178.0 | 0.0 | 1118.0 | 4500.0 | 1500.0 | 2580.0 | 1500.0 | 1118.0 | 1331.0 | 2580.0 | 0.0 | 178.0 | 0 | 0 | 0 | 1.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19995 | 28636 | 330000.0 | 2 | 1 | 2 | 33 | -2.0 | 2.0 | 2.0 | 1.0 | -1.0 | -2.0 | -200.0 | 150.0 | 350.0 | -200.0 | 150.0 | -200.0 | 350.0 | 200.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 3 | 0.580952 |
| 19996 | 17730 | 50000.0 | 1 | 2 | 2 | 49 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 48551.0 | 18077.0 | 18815.0 | 59236.0 | 34656.0 | 53689.0 | 2047.0 | 1000.0 | 2000.0 | 2000.0 | 5618.0 | 2124.0 | 1 | 0 | 0 | 0.100799 |
| 19997 | 28030 | 410000.0 | 1 | 1 | 2 | 32 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 0.0 | 7764.0 | 4695.0 | 4133.0 | 3339.0 | 623.0 | 13047.0 | 623.0 | 5973.0 | 4421.0 | 10000.0 | 4695.0 | 7764.0 | 0 | 0 | 0 | 1.042331 |
| 19998 | 15725 | 200000.0 | 1 | 1 | 1 | 40 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | 5074.0 | 5606.0 | 2534.0 | -7.0 | 19268.0 | 1765.0 | 19364.0 | 2547.0 | 33300.0 | 1772.0 | 5634.0 | 5099.0 | 0 | 0 | 0 | 1.004800 |
| 19999 | 19966 | 60000.0 | 1 | 2 | 2 | 25 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 | 2.0 | 15444.0 | 17598.0 | 17472.0 | 36090.0 | 13695.0 | 22083.0 | 1000.0 | 1000.0 | 603.0 | 2017.0 | 5000.0 | 5.0 | 1 | 1 | 3 | 0.101208 |
20000 rows × 28 columns
# This function returns the amount of months that have passed since the last missed payment.
def mo_since_last_miss(to_train):
months = {'JUNE':1, 'MAY':2, 'APRIL':3, 'MARCH':4, 'FEBRUARY':5, 'JANUARY':6}
smallest = 7
for i, j in months.items():
if (to_train[f'PAY_{i}'] > 0) and (j<smallest):
smallest = j
if smallest == 7:
smallest = 0
return smallest
to_train['MO_SINCE_LAST_MISS'] = to_train.apply(mo_since_last_miss, axis=1)
to_train
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_APRIL | PAY_FEBRUARY | PAY_JANUARY | PAY_JUNE | PAY_MARCH | PAY_MAY | BILL_AMT_APRIL | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | BILL_AMT_JUNE | BILL_AMT_MARCH | BILL_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MARCH | PAY_AMT_MAY | DEFAULT_JULY | HAS_MISSED_PAY_BEFORE | TIMES_MISSED_PAY | PERC_PAID_ON_AVG | MO_SINCE_LAST_MISS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9910 | 130000.0 | 2 | 2 | 1 | 27 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 81231.0 | 83773.0 | 85532.0 | 76885.0 | 81983.0 | 79106.0 | 3000.0 | 3260.0 | 3200.0 | 3500.0 | 3146.0 | 4000.0 | 0 | 0 | 0 | 0.041150 | 0 |
| 1 | 15561 | 60000.0 | 2 | 3 | 1 | 48 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 8422.0 | 3910.0 | 2431.0 | 4823.0 | 4377.0 | 5471.0 | 4377.0 | 2431.0 | 2120.0 | 5491.0 | 3918.0 | 9683.0 | 1 | 0 | 0 | 1.031086 | 0 |
| 2 | 23607 | 330000.0 | 2 | 1 | 2 | 44 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 253863.0 | 262753.0 | 268145.0 | 243621.0 | 257535.0 | 259517.0 | 9400.0 | 9766.0 | 9786.0 | 21400.0 | 9542.0 | 0.0 | 1 | 1 | 1 | 0.038339 | 3 |
| 3 | 6314 | 60000.0 | 2 | 2 | 2 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15069.0 | 5879.0 | -2879.0 | 29832.0 | 15761.0 | 18628.0 | 1000.0 | 1329.0 | 41378.0 | 1538.0 | 118.0 | 1308.0 | 0 | 0 | 0 | -0.041747 | 0 |
| 4 | 27534 | 180000.0 | 2 | 1 | 2 | 33 | -1.0 | -1.0 | -1.0 | -2.0 | -1.0 | -2.0 | 178.0 | 0.0 | 1118.0 | 4500.0 | 1500.0 | 2580.0 | 1500.0 | 1118.0 | 1331.0 | 2580.0 | 0.0 | 178.0 | 0 | 0 | 0 | 1.000000 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19995 | 28636 | 330000.0 | 2 | 1 | 2 | 33 | -2.0 | 2.0 | 2.0 | 1.0 | -1.0 | -2.0 | -200.0 | 150.0 | 350.0 | -200.0 | 150.0 | -200.0 | 350.0 | 200.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 3 | 0.580952 | 1 |
| 19996 | 17730 | 50000.0 | 1 | 2 | 2 | 49 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 48551.0 | 18077.0 | 18815.0 | 59236.0 | 34656.0 | 53689.0 | 2047.0 | 1000.0 | 2000.0 | 2000.0 | 5618.0 | 2124.0 | 1 | 0 | 0 | 0.100799 | 0 |
| 19997 | 28030 | 410000.0 | 1 | 1 | 2 | 32 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 0.0 | 7764.0 | 4695.0 | 4133.0 | 3339.0 | 623.0 | 13047.0 | 623.0 | 5973.0 | 4421.0 | 10000.0 | 4695.0 | 7764.0 | 0 | 0 | 0 | 1.042331 | 0 |
| 19998 | 15725 | 200000.0 | 1 | 1 | 1 | 40 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | 5074.0 | 5606.0 | 2534.0 | -7.0 | 19268.0 | 1765.0 | 19364.0 | 2547.0 | 33300.0 | 1772.0 | 5634.0 | 5099.0 | 0 | 0 | 0 | 1.004800 | 0 |
| 19999 | 19966 | 60000.0 | 1 | 2 | 2 | 25 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 | 2.0 | 15444.0 | 17598.0 | 17472.0 | 36090.0 | 13695.0 | 22083.0 | 1000.0 | 1000.0 | 603.0 | 2017.0 | 5000.0 | 5.0 | 1 | 1 | 3 | 0.101208 | 1 |
20000 rows × 29 columns
train_x_corr = to_train.drop(['ID', 'PAY_JUNE', 'PAY_MAY', 'PAY_APRIL', 'PAY_MARCH', 'PAY_FEBRUARY', 'PAY_JANUARY', 'BILL_AMT_JUNE', 'BILL_AMT_MAY', 'BILL_AMT_APRIL', 'BILL_AMT_MARCH', 'BILL_AMT_FEBRUARY', 'BILL_AMT_JANUARY', 'PAY_AMT_JUNE', 'PAY_AMT_MAY', 'PAY_AMT_APRIL', 'PAY_AMT_MARCH', 'PAY_AMT_FEBRUARY', 'PAY_AMT_JANUARY',], axis=1)
corr_matrix = train_x_corr.corr()
sn.heatmap(corr_matrix, annot=True)
plt.show()
# In here I removed all variables that had less than -.1 or .1 correlation with DEFAULT_JULY as well as
# all PAY_*, BILL_AMT_*, and PAY_AMT_* fields since the insights found on those were used to
# obtain the calculated fields used instead.
train_x = to_train.drop(['DEFAULT_JULY', 'ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PERC_PAID_ON_AVG', 'PAY_JUNE', 'PAY_MAY', 'PAY_APRIL', 'PAY_MARCH', 'PAY_FEBRUARY', 'PAY_JANUARY', 'BILL_AMT_JUNE', 'BILL_AMT_MAY', 'BILL_AMT_APRIL', 'BILL_AMT_MARCH', 'BILL_AMT_FEBRUARY', 'BILL_AMT_JANUARY', 'PAY_AMT_JUNE', 'PAY_AMT_MAY', 'PAY_AMT_APRIL', 'PAY_AMT_MARCH', 'PAY_AMT_FEBRUARY', 'PAY_AMT_JANUARY'], axis=1)
train_y = to_train['DEFAULT_JULY']
X_train, X_test, y_train, test_y = train_test_split(train_x, train_y, test_size=0.3, random_state=None)
train_x
| LIMIT_BAL | HAS_MISSED_PAY_BEFORE | TIMES_MISSED_PAY | MO_SINCE_LAST_MISS | |
|---|---|---|---|---|
| 0 | 130000.0 | 0 | 0 | 0 |
| 1 | 60000.0 | 0 | 0 | 0 |
| 2 | 330000.0 | 1 | 1 | 3 |
| 3 | 60000.0 | 0 | 0 | 0 |
| 4 | 180000.0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... |
| 19995 | 330000.0 | 1 | 3 | 1 |
| 19996 | 50000.0 | 0 | 0 | 0 |
| 19997 | 410000.0 | 0 | 0 | 0 |
| 19998 | 200000.0 | 0 | 0 | 0 |
| 19999 | 60000.0 | 1 | 3 | 1 |
20000 rows × 4 columns
# Random Forest was selected as the best algorithm to incorporate here, and runs with the parameters below.
parameters = dict(
max_depth=range(3, 40, 2),
min_samples_split=range(3, 40, 2),
min_samples_leaf=range(1, 40, 2)
)
clf = RandomizedSearchCV(
estimator=RandomForestClassifier(),
param_distributions=parameters,
n_iter=5,
scoring="roc_auc_ovr_weighted",
n_jobs=7,
return_train_score=True,
random_state = 73
)
clf.fit(X_train, y_train)
RandomizedSearchCV(estimator=RandomForestClassifier(), n_iter=5, n_jobs=7,
param_distributions={'max_depth': range(3, 40, 2),
'min_samples_leaf': range(1, 40, 2),
'min_samples_split': range(3, 40, 2)},
random_state=73, return_train_score=True,
scoring='roc_auc_ovr_weighted')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(estimator=RandomForestClassifier(), n_iter=5, n_jobs=7,
param_distributions={'max_depth': range(3, 40, 2),
'min_samples_leaf': range(1, 40, 2),
'min_samples_split': range(3, 40, 2)},
random_state=73, return_train_score=True,
scoring='roc_auc_ovr_weighted')RandomForestClassifier()
RandomForestClassifier()
print(clf.best_params_)
print(clf.best_score_)
{'min_samples_split': 29, 'min_samples_leaf': 15, 'max_depth': 3}
0.7459157990036696
results = pd.DataFrame(clf.cv_results_)
# Analyze results
test_results = pd.DataFrame(results).filter(regex="split\d_test|mean_test|std_test|param_")
test_results["CoV"] = test_results["std_test_score"]/test_results["mean_test_score"]*100
test_results.sort_values("mean_test_score", ascending=False)
| param_min_samples_split | param_min_samples_leaf | param_max_depth | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | CoV | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 29 | 15 | 3 | 0.728920 | 0.742342 | 0.760511 | 0.739772 | 0.758035 | 0.745916 | 0.011826 | 1.585441 |
| 3 | 31 | 1 | 5 | 0.728218 | 0.743847 | 0.761446 | 0.737782 | 0.754851 | 0.745229 | 0.011852 | 1.590358 |
| 2 | 13 | 23 | 23 | 0.726265 | 0.742583 | 0.758756 | 0.728890 | 0.750758 | 0.741451 | 0.012456 | 1.679942 |
| 4 | 7 | 21 | 31 | 0.725918 | 0.741037 | 0.758529 | 0.728081 | 0.751571 | 0.741028 | 0.012754 | 1.721166 |
| 0 | 37 | 7 | 29 | 0.722333 | 0.738835 | 0.755537 | 0.722627 | 0.747986 | 0.737464 | 0.013329 | 1.807412 |
# Analyze results
train_results = pd.DataFrame(results).filter(regex="split\d_train|mean_train|std_train|param_")
train_results["CoV"] = train_results["std_train_score"]/train_results["mean_train_score"]*100
train_results.sort_values("mean_train_score", ascending=False)
| param_min_samples_split | param_min_samples_leaf | param_max_depth | split0_train_score | split1_train_score | split2_train_score | split3_train_score | split4_train_score | mean_train_score | std_train_score | CoV | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 37 | 7 | 29 | 0.768684 | 0.764073 | 0.759709 | 0.767434 | 0.762267 | 0.764434 | 0.003293 | 0.430757 |
| 4 | 7 | 21 | 31 | 0.764165 | 0.760130 | 0.755074 | 0.763027 | 0.757524 | 0.759984 | 0.003374 | 0.443955 |
| 2 | 13 | 23 | 23 | 0.763914 | 0.759239 | 0.754816 | 0.762380 | 0.757087 | 0.759487 | 0.003335 | 0.439071 |
| 3 | 31 | 1 | 5 | 0.759369 | 0.754846 | 0.749555 | 0.757673 | 0.751587 | 0.754606 | 0.003655 | 0.484300 |
| 1 | 29 | 15 | 3 | 0.752203 | 0.748540 | 0.743673 | 0.749820 | 0.744567 | 0.747761 | 0.003209 | 0.429153 |
# In here I added the calculated fields above to the testing set and removed those unused as well.
test_data['HAS_MISSED_PAY_BEFORE'] = test_data.apply(has_missed_pay, axis=1)
test_data['TIMES_MISSED_PAY'] = test_data.apply(times_missed_pay, axis=1)
test_data['PERC_PAID_ON_AVG'] = test_data.apply(perc_paid_on_avg, axis=1)
test_data['MO_SINCE_LAST_MISS'] = test_data.apply(mo_since_last_miss, axis=1)
test_x = test_data.drop(['DEFAULT_JULY', 'ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PERC_PAID_ON_AVG', 'PAY_JUNE', 'PAY_MAY', 'PAY_APRIL', 'PAY_MARCH', 'PAY_FEBRUARY', 'PAY_JANUARY', 'BILL_AMT_JUNE', 'BILL_AMT_MAY', 'BILL_AMT_APRIL', 'BILL_AMT_MARCH', 'BILL_AMT_FEBRUARY', 'BILL_AMT_JANUARY', 'PAY_AMT_JUNE', 'PAY_AMT_MAY', 'PAY_AMT_APRIL', 'PAY_AMT_MARCH', 'PAY_AMT_FEBRUARY', 'PAY_AMT_JANUARY'], axis=1)
test_x = test_x[test_x.columns]
proba = clf.predict_proba(test_x)
pred = clf.predict(test_x)
proba = proba[:,1]
test_y = test_data['DEFAULT_JULY']
auc_score = roc_auc_score(test_y, proba, average=None)
avg_auc_score = roc_auc_score(test_y, proba, average="weighted")
print(f"AUC Class: {round(auc_score,3)}")
print(f"Avg. AUC (Macro Wieighted): {round(avg_auc_score,3)}")
AUC Class: 0.756 Avg. AUC (Macro Wieighted): 0.756
print(classification_report(test_y, pred))
precision recall f1-score support
0 0.83 0.95 0.89 7005
1 0.64 0.32 0.43 1995
accuracy 0.81 9000
macro avg 0.74 0.63 0.66 9000
weighted avg 0.79 0.81 0.78 9000
fig = go.Figure()
for i in clf.classes_:
y_true = (test_y==i).astype(int).values
y_score = proba
fpr, tpr, _ = roc_curve(y_true, y_score)
name = f"Class {i} (AUC={auc_score:.3f})"
fig.add_trace(go.Scatter(x=fpr, y=tpr, name=name, mode='lines', line=dict(width=3)))
fig.update_layout(
title="ROC for each Class",
xaxis_title='False Positive Rate',
yaxis_title='True Positive Rate',
template="none",
yaxis=dict(scaleanchor="x", scaleratio=0.5),
width=1200, height=600
)
(1.5 points)
sub = submission_features.copy()
sub = sub.set_index('ID')
sub
| LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_JUNE | PAY_MAY | PAY_APRIL | PAY_MARCH | PAY_FEBRUARY | PAY_JANUARY | BILL_AMT_JUNE | BILL_AMT_MAY | BILL_AMT_APRIL | BILL_AMT_MARCH | BILL_AMT_FEBRUARY | BILL_AMT_JANUARY | PAY_AMT_JUNE | PAY_AMT_MAY | PAY_AMT_APRIL | PAY_AMT_MARCH | PAY_AMT_FEBRUARY | PAY_AMT_JANUARY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||||
| 2774 | 70000.0 | 2 | 3 | 2 | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 52899.0 | 45758.0 | 37858.0 | 29660.0 | 29519.0 | 29457.0 | 1726.0 | 3028.0 | 1026.0 | 1195.0 | 1060.0 | 1031.0 |
| 15339 | 100000.0 | 2 | 2 | 1 | 38 | 1 | 2 | 0 | 0 | 2 | 0 | 14483.0 | 13961.0 | 15323.0 | 16268.0 | 15868.0 | 16448.0 | 0.0 | 1600.0 | 1500.0 | 0.0 | 1000.0 | 1500.0 |
| 26485 | 140000.0 | 2 | 1 | 2 | 40 | 1 | 2 | 0 | 0 | 0 | 0 | 143478.0 | 135633.0 | 128285.0 | 123074.0 | 97382.0 | 99364.0 | 0.0 | 6601.0 | 5522.0 | 3200.0 | 3300.0 | 3500.0 |
| 6657 | 380000.0 | 2 | 2 | 2 | 27 | 0 | 0 | -1 | 0 | 0 | 0 | 4400.0 | 0.0 | 270219.0 | 3641.0 | 3717.0 | 3791.0 | 0.0 | 270219.0 | 131.0 | 135.0 | 136.0 | 76.0 |
| 7917 | 100000.0 | 1 | 2 | 2 | 26 | 0 | 0 | 0 | 0 | 0 | 0 | 32186.0 | 32949.0 | 33680.0 | 34419.0 | 37782.0 | 76185.0 | 2000.0 | 2000.0 | 2000.0 | 4500.0 | 40000.0 | 3000.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2501 | 280000.0 | 1 | 1 | 2 | 28 | 0 | 0 | 0 | 0 | 0 | 0 | 139719.0 | 136052.0 | 118076.0 | 101457.0 | 81124.0 | 42347.0 | 6000.0 | 4000.0 | 4000.0 | 4000.0 | 3000.0 | 1000.0 |
| 676 | 180000.0 | 1 | 2 | 1 | 48 | -1 | -1 | -1 | -1 | -1 | -1 | 1294.0 | 1294.0 | 1466.0 | 1294.0 | 2324.0 | 264.0 | 1294.0 | 1466.0 | 1294.0 | 2324.0 | 264.0 | 264.0 |
| 26934 | 90000.0 | 2 | 2 | 1 | 32 | 1 | -1 | -1 | -1 | -1 | -1 | 0.0 | 323.0 | 0.0 | 2520.0 | 1651.0 | 0.0 | 323.0 | 0.0 | 2520.0 | 1651.0 | 0.0 | 0.0 |
| 27956 | 60000.0 | 2 | 2 | 2 | 28 | 2 | 2 | 2 | 2 | 2 | 2 | 48322.0 | 51587.0 | 52664.0 | 53059.0 | 51468.0 | 54511.0 | 4692.0 | 2500.0 | 2000.0 | 0.0 | 4019.0 | 0.0 |
| 6298 | 20000.0 | 2 | 1 | 2 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 15036.0 | 16069.0 | 16774.0 | 17266.0 | 17539.0 | 18033.0 | 1580.0 | 1280.0 | 770.0 | 700.0 | 785.0 | 600.0 |
1000 rows × 23 columns
sub = submission_features.copy()
sub['HAS_MISSED_PAY_BEFORE'] = sub.apply(has_missed_pay, axis=1)
sub['TIMES_MISSED_PAY'] = sub.apply(times_missed_pay, axis=1)
sub['MO_SINCE_LAST_MISS'] = sub.apply(mo_since_last_miss, axis=1)
sub = sub.drop(['ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_JUNE', 'PAY_MAY', 'PAY_APRIL', 'PAY_MARCH', 'PAY_FEBRUARY', 'PAY_JANUARY', 'BILL_AMT_JUNE', 'BILL_AMT_MAY', 'BILL_AMT_APRIL', 'BILL_AMT_MARCH', 'BILL_AMT_FEBRUARY', 'BILL_AMT_JANUARY', 'PAY_AMT_JUNE', 'PAY_AMT_MAY', 'PAY_AMT_APRIL', 'PAY_AMT_MARCH', 'PAY_AMT_FEBRUARY', 'PAY_AMT_JANUARY'], axis=1)
sub_proba = clf.predict_proba(sub)
sub_proba = sub_proba[:,1]
sub_proba = pd.Series(sub_proba, name='PROBABILITY')
submission = pd.merge(submission_features.copy(), sub_proba, right_index=True, left_index=True)
submission = submission[['ID', 'PROBABILITY']]
# Submission dataset probabilities:
submission
| ID | PROBABILITY | |
|---|---|---|
| 0 | 2774 | 0.143892 |
| 1 | 15339 | 0.555450 |
| 2 | 26485 | 0.439663 |
| 3 | 6657 | 0.094969 |
| 4 | 7917 | 0.114265 |
| ... | ... | ... |
| 995 | 2501 | 0.103239 |
| 996 | 676 | 0.106378 |
| 997 | 26934 | 0.405543 |
| 998 | 27956 | 0.647503 |
| 999 | 6298 | 0.174278 |
1000 rows × 2 columns
# This creates the csv used in submission
#submission.to_csv('Submissions/FordArturo_C1_submission.csv', index=False)
(1 points)
submission = submission.sort_values('PROBABILITY', axis=0)
# Customers with the lowest probability of defaulting:
submission.head(5)
| ID | PROBABILITY | |
|---|---|---|
| 584 | 3237 | 0.087383 |
| 816 | 21358 | 0.087383 |
| 383 | 18523 | 0.087383 |
| 565 | 15141 | 0.087383 |
| 691 | 24668 | 0.091440 |
(2.5 points)
# With the following code we calculate the optimal probability cutoff using the ROC calculations.
false_pos_rate, true_pos_rate, probab = roc_curve(test_y, proba)
optimal_proba_cutoff = sorted(list(zip(np.abs(true_pos_rate - false_pos_rate), probab)), key=lambda i: i[0], reverse=True)[0][1]
roc_predictions = [1 if i >= optimal_proba_cutoff else 0 for i in proba]
optimal_proba_cutoff
0.25904195865470914
# Some accuracy statistics for evaluating different models
print("Accuracy Score Before and After Thresholding: {}, {}".format(accuracy_score(test_y, pred), accuracy_score(test_y, roc_predictions)))
print("Precision Score Before and After Thresholding: {}, {}".format(precision_score(test_y, pred), precision_score(test_y, roc_predictions)))
print("Recall Score Before and After Thresholding: {}, {}".format(recall_score(test_y, pred), recall_score(test_y, roc_predictions)))
print("F1 Score Before and After Thresholding: {}, {}".format(f1_score(test_y, pred), f1_score(test_y, roc_predictions)))
Accuracy Score Before and After Thresholding: 0.8095555555555556, 0.7406666666666667 Precision Score Before and After Thresholding: 0.6403596403596403, 0.44240570846075433 Recall Score Before and After Thresholding: 0.3213032581453634, 0.6526315789473685 F1 Score Before and After Thresholding: 0.4279038718291055, 0.5273390036452006
# Confusion matrix used to check the accuracy of the models attempted.
y_actual = pd.Series(train_y, name='Actual')
y_actual = y_actual.reset_index(drop=True)
y_predict_tf = pd.Series(roc_predictions, name='Predicted')
df_confusion = pd.crosstab(y_actual, y_predict_tf, rownames=['Actual'], colnames=['Predicted'], margins=True)
print (df_confusion)
Predicted 0 1 All Actual 0 4736 2275 7011 1 1321 668 1989 All 6057 2943 9000